Exploiting Linguistically-Enriched Models for Phrase-Based Statistical Machine Translation

نویسنده

  • Noemie Guthmann
چکیده

This thesis presents the design and implementation of linguistically-informed models for statistical phrase-based machine translation. Using Koehn's Pharaoh (2004), a state-of-the-art SMT system, and Moses (Hoang, 2006), a variant of the former which supports factored translation models, we have investigated two approaches: Combined Feature Models and Factored Models. While Combined Feature Models make use of concatenations of linguistic features to enrich their models, Factored Models view a token as a vector of factors, enabling to build relatively independent models for each factor. In the context of machine translation, both models were expected to enrich the existing surface word model with additional linguistic information. The research undertaken focused on finding ways to improve output translation quality for English-to-French and French-to-English translations from various standpoints. A better general readability and understandability of a generated document should be achieved mainly by ensuring the text fluency in the target language (syntactic correctness), its adequacy (use of adequate terminology) and its fidelity (semantic adequacy). These main goals were addressed by first of all analyzing the Pharaoh's current performance, and understanding language-specific and model-related problems encountered. Several experiments were then performed using our two approaches, and their results were compared. Despite a few noted improvements in some of the linguistic issues discussed, notably fixed expression translation and part-of-speech ambiguity, major problems involving complex syntactic structures in the source language still posed a hard challenge to the approach of linguistically augmenting phrase-based statistical machine translation. 3 Acknowledgments

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constraining the Phrase-Based, Joint Probability Statistical Translation Model

The Joint Probability Model proposed by Marcu and Wong (2002) provides a probabilistic framework for modeling phrase-based statistical machine translation (SMT). The model’s usefulness is, however, limited by the computational complexity of estimating parameters at the phrase level. We present a method of constraining the search space of the Joint Probability Model based on statistically and li...

متن کامل

Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation

Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by han...

متن کامل

Linguistically Annotated BTG for Statistical Machine Translation

Bracketing Transduction Grammar (BTG) is a natural choice for effective integration of desired linguistic knowledge into statistical machine translation (SMT). In this paper, we propose a Linguistically Annotated BTG (LABTG) for SMT. It conveys linguistic knowledge of source-side syntax structures to BTG hierarchical structures through linguistic annotation. From the linguistically annotated da...

متن کامل

Linguistically Annotated Reordering: Evaluation and Analysis

Linguistic knowledge plays an important role on phrase movement in statistical machine translation. To efficiently incorporate linguistic knowledge into phrase reordering, we propose a new approach: Linguistically Annotated Reordering (LAR). In LAR, we build hard hierarchical skeletons and inject soft linguistic knowledge from source parse trees to nodes of hard skeletons during translation. Th...

متن کامل

Prior Derivation Models For Formally Syntax-Based Translation Using Linguistically Syntactic Parsing and Tree Kernels

This paper presents an improved formally syntax-based SMT model, which is enriched by linguistically syntactic knowledge obtained from statistical constituent parsers. We propose a linguistically-motivated prior derivation model to score hypothesis derivations on top of the baseline model during the translation decoding. Moreover, we devise a fast training algorithm to achieve such improved mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006